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Abstract The aim of this paper is to extend the family of Glushkov automata. This is 
achieved by designing new operators, the so-called multi-tilde-bar operators, that allow us to 
compute Glushkov functions for the associated extended expressions. Conversely an extended 
Glushkov automaton with n + 1 states can be converted into an extended expression with 
n occurrences of symbols. It leads to a characterization in terms of graphs of the family of 
extended Glushkov automata. Moreover, extended expressions are shown to be superpoly- 
nomially more succinct than standard expressions. 


1 Introduction 


Converting finite automata into regular expressions and vice versa [19] is one of the most 
fundamental operations in formal languages. The relation between regular expressions and 
Glushkov automata is interesting since it facilitates the conversion from one to another and 
these automata possess a number of properties that make them useful for many practical 
applications [2, 10]. 

A Glushkov automaton produces an equivalent expression of linear size, as opposed to 
the exponential size in the general worst case [14]. Hence the following natural idea: extend- 
ing regular expressions with operators that are compatible with the structure of Glushkov 
automaton with the hope of obtaining more succinct regular expressions. 

In this paper, following [20], we present the multi-tilde-bar operators and we show how to 
compute the Glushkov functions of the associated extended expressions. Conversely we show 
that an extended Glushkov automaton with n + 1 states can be converted into an extended 
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expression with n occurrences of symbols. As a result the characterization in terms of graphs 
of the family of Glushkov automata [9,8] is generalized to the family of extended Glushkov 
automata. Moreover, we exhibit a family of regular languages that proves a superpolynomial 
gap between the succinctness of extended expressions and standard expressions. 

Other families of operators have been investigated, in particular multi-bar operators [6] 
and multi-tilde operators [5]. A slightly different definition of a multi-tilde-bar expression 
has been used in [7] where it is proved that any acyclic n-state automaton can be turned into 
an extended expression of size O(n). 

Section 2 is a preliminary section; it gathers classical notions concerning regular lan- 
guages, regular expressions and finite automata; it also recalls the definition and main prop- 
erties of multi-tilde and multi-bar operators. The multi-tilde-bar operators and the associated 
extended regular expressions are introduced in Sect. 3 as well as a normal form for these 
expressions. Section 4 is devoted to the computation of the Glushkov functions of an extended 
expression while Sect. 5 addresses the conversion of an extended Glushkoy automaton into 
an extended expression, leading to a characterization of this family of automata. The last 
section investigates the power of factorization of extended expressions. 


2 Preliminaries 
2.1 Languages, regular expressions and automata 


An alphabet is a finite set of symbols. Given an alphabet X, any subset of X* is a language 
over X. The set of regular languages over X is defined as the smallest family of languages 
containing Ø and {a} for every symbol a in X and closed under union, catenation and Kleene 
star. 

A regular expression E over an alphabet X is inductively defined by E = Ø, E = e, 
E =a, E =(F+G),E = (F -G), E = (F*) witha asymbol in X, and F and G two regular 
expressions over X. The language denoted by a regular expression is inductively defined by 
L(G) = Ø, L(e) = {£}, L(a) = {a}, L(F + G) = L(F) U L(G), L(F - G) = L(F)- L(G) 
and L(F*) = L(F)*, with a a symbol in X, and F and G two regular expressions over X. 
By construction, the language denoted by a regular expression is regular. 

A finite automaton Ais a5-tuple (X, Q, I, F, ô) where X is an alphabet, Q is a finite set 
of states, I C Q aset of initial states, F C Q a set of final states and ô C Q x X x Q aset of 
transitions. The set ô can be seen as a function from Q x X to 22 defined by q’ € 5(q,a) & 
(q,a, q’) € 6. The domain of the function ô can be extended to 20 x 5* by setting, for all 
O' C Q, 8(Q', £) = Q', 8&(Q',a) = Ugeg 84, a), 8(Q', a - w) = 8(5(Q', a), w) for all 
word w in X* . A state q € Q is said to be accessible (resp. coaccessible) if there exists 
w € X*suchthatg €&8(1, w) (resp. (q, w) NF 4 Ø). An automaton is said to be accessible 
(resp. coaccessible) if all its states are accessible (resp. coaccessible). An automaton is said 
to be trim if it is accessible and coaccessible. The language recognized by the automaton A 
is the set L(A) = {w € X* | 5U, w) NF 4 Ø}. A language L is recognizable if there exists 
an automaton that recognizes it. 

Kleene theorem [18] asserts that a language is regular if and only if it is recognizable. As 
a consequence, for every regular language L, there exists an automaton A and an expression 
E such that L = L(E) = L(A). Several algorithms exist to transform an expression into an 
automaton [1,3, 12, 15,17,19] and an automaton into an expression [4,9, 19]. 

Let E be a regular expression over an alphabet X. The alphabetic width|E| of E is 
the number of occurrences of symbols of X appearing in E. We will say width instead of 
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alphabetical width for convenience. The expression E is said to be trim if either E = Ø 
or E contains no occurrence of Ø. The expression E is linear if each one of its symbol 
occurs exactly once in E. The Glushkov automaton or position automaton [15,19] of a linear 
and trim expression E is a (|E| + 1)-state automaton computed from the five following 
position functions Pos(E) = X, Null(£) = {e} if € € L(E), Null(E) = Ø otherwise, 
First(E) = {x € Pos(E) | dw e Pos(E)*,xw €e L(E)}, Last(E) = {x e Pos(£) | 
dw € Pos(E)*, wx € L(E)}, and for every position x in Pos(E), Follow(E, x) = {y € 
Pos(E) | dw, w’ € Pos(E)*, wxyw’ € L(E)}. The position automaton of E is the automaton 
noted G(E) defined by the 5-tuple (X, Q, I, F,5) with Q = Pos(E) U {0}, Z = {0}, 
F = {0} U Last(£) if Null(E) = {e}, F = Last(E) otherwise, and 6 = {(qg,q’,q') € 
QxXxQ|q' € Folow(E,q)} U {(,q,q¢) € {0} x © x Q | q € First(E)}. The 
automaton A recognizes L(£). Every expression E can be linearized into an expression 
E* over an alphabet 5* by substituting every occurrence of symbol by its position (e.g. 
(a +b- b* +d)* = 142.3* + 4). Let h be the function from ©* to © that maps a 
position to its corresponding symbol. The function h is a morphism from the monoid (2*)* 
to the monoid ©*. For a simple regular expression E, h(L(E*)) = L(E). This property 
characterizes operators that are compatible with the linearization. 


2.2 The multi-tilde and multi-bar operators 


Let i and f be two positive integers. The couple (i, f) is said to be growing if i < f. The 
set of integers {i, i + 1,..., f — 1, f} is denoted by |i, f]. The subset of growing couples 
of [i, f]? is denoted by [i, f]~. 

Let L1,..., Ln ben nonempty regular languages over X and w be a word in Lı --- Ln. 
A sequence (w1, ..., Wy) Satisfying w1 -< - Wn = wAVk € [1,n], we € Lp is said to be 
a split up of w over (L1, ..., Ln). Let (i, f} be a couple of integers in [1, nz. The word 
wi +++ wp is said to be an g-factor of w if wi- -wf = e. If wi—1 = € (resp. w f+1 = 6), 
the word w;--- wy admits a left e-extension (resp. admits a right e-extension). If the word 
wi +++ wr admits no €-extension, then it is an €-maximal factor of w. 

In the following, totally ordered sets are called for convenience lists. For a list of growing 
couples of integers, the lexicographical order is considered: a couple (i, f) is smaller than 
a couple (i’, f^) if eitheri <i’ or (i= i'^ f < f’). In this case, we denote this relation by 
(i, f) < Wi, f^). For a list (E1, E2,..., En) of regular expressions, we consider the order 
of definition Ex < Ey 4 k < k' for all couple (k, k’) in [1, nl’. For convenience, the list 
(E\,..., En) of expressions is denoted by EF ,,. Similarly, a catenation E; --- En is denoted 
by Elon: 

Let n be a positive integer. The set of finite lists of growing couples in [1, n]2 is denoted 
by Sn. Let S be a list in S,. The set of indices of S is the set Is = [1, (#5)]. Hence, a list 
S in Sy is denoted by S = ((ix, fk))kers, With for all integer k in Is, (ix, fx) € [1, nz. We 
denote by Sup(S) (resp. Inf (S), Succ(s) and Pred(s) for s € S) the greatest couple in S (resp. 
the smallest couple in S, the smallest couple in S greater than s and the greatest couple in S 
smaller than s) if it exists. 

Let (i, f) be a couple in a list S. The couple (i, f) represents the subset [i, f] of [1, n]. 
The interval |i, f] is the range of (i, f), denoted by Range((i, f)). Two couples (i, f) 
and (i’, f") are said to overlap if and only if i’ < i < f’ < fori <i’ <f < f’. 
A couple (i, f) is included if there exists a couple (i’, f’) different from (i, f) such that 
i! <i < f< f’.Thecouple (i’, f’) is said to include (i, f). Acouple (i, f) is overhanging if 
itnot overlapped. A list S is free if for all couples (i, f), (i’, f’) in S such that (i, f) 4 i’, f), 
li, f] A li’, f'] = Ø. Every list containing at most one element is by definition a free list. 
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The unary operators tilde, denoted by ~^, and bar, denoted by __ are defined for every 


expression E by LCE) = L(E) U {e} and L( E ) = L(E) \ {e}. They are extended 
to multi-tilde and multi-bar operators, using the following definitions, where n is a positive 
integer. 

Definition 1 Let (w1, ..., w,) beasplitup of a word w over a list of languages (L1, .. +, Ly). 
Let (i, f) be a couple in [1, nq. The sequence (w1,...,W,) satisfies the bar (i, f) if 
Wir we FE. 

Definition 2 Let (w1, ..., wn) be a split up of a word w over a list of languages (Lj U 


{e},..., Ln Uf{e}). Let T bea free list in Sn. The sequence (w1,..., Wn) is generated by the 
list T if it holds: w = € if k € Ua, perli, f] and wg € Lg otherwise. 


Definition 3 Let E1, be a list of expressions over an alphabet X. Let B be a list in S,. The 


multi-bar Æ = — ,,(F1,n) denotes the language 
w € X* | there exists a split up of wover (L(£)),..., L(En)) 
L(E) = Gia 3 
satisfying every bar in B. 


Definition 4 Let E1 „ be a list of expressions over an alphabet X. Let T be a list in S,. The 
multi-tilde E = ~~, (E1,n) denotes the language 


L(E) = | w € X* | there exists a split up of w over(L(E}),..., een 


generated by a free sublist of T. 
3 The multi-tilde-bar operators 
Multi-tilde-bar operators are a natural combination of multi-tilde and multi-bar operators. In 


fact, bars are used to forbid some combinations of tildes. Consequently, the satisfaction of a 
bar by a sequence has to be redefined by adding a list of tildes as a context. 


Definition 5 Let E1 „be a list ofn regular expressions. Let (w1, ..., w,) be a split up of a 
word w over (L(E1)U {e}. .., L(En) U {e}) generated by a free list T in S}. Let b = (i, f) 
be a couple in [1, nz\T. The bar b is said to be satisfied by (w1,..., Wn) w.rt. T if at least 


one of the three following conditions is satisfied: 


(1) there exists a couple ¢ in T such that ¢ overlaps b, 

(2) there exists a couple ¢ in T such that b is included in ¢, 

(3) wi- wp FE. 

According to this definition, the language denoted by a multi-tilde-bar can be expressed as 


follows: 


Definition 6 Let £,,, be a list of expressions over an alphabet X and L’ be the list (L(£1) U 
{e},..., L(En) U {e}) of languages. Let B and T be two lists in S, such that BN T = Ø. 
The multi-tilde-bar E = ~~z. g (Ein) denotes the language 


w € X* | there exists a split up of w over L’ 
L(E) = generated by a free sublist T’ of T 
satisfying every bar in B w.r.t. T’. 


Let us notice that Definition 6 is equivalent to the definition of the language of a multi- 
tilde-bar introduced in [7]. A complete proof of this equivalence is provided in [20]. 
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3.1 A new family of operators 


Definition 7 An Extended to multi-tilde-bar Regular Expression (EmtbRE) over an alphabet 
X is inductively defined by: 


E =Í, E = (F + G), with F and G two EmtbREs, 
E =a, witha € X, E= (F.G), with F and G two EmtbREs, 
E = (F*), with F an EmtbRE, 
E= rpg (Ern) , with T and B two disjoint lists in Sn 
l and Ei n a list of EmtbREs. 


Let us remark that the expression ¢ is not an EmtbRE. However it holds that L(e) = L( 9 ). 
Moreover, another property of simple regular expressions is not shared by EmtbREs: it is 
a folk knowledge that any simple regular expression can be transformed into an equivalent 
trim one. This transformation is not so trivial for EmtbREs. This could be problematic for the 
computation of Glushkov functions. Meanwhile, this problem can be solved by considering 
the symbol Ø as a symbol in a particular alphabet. This can be achieved by extending the 
notion of linearity to J-linearity as follows: 


Definition 8 Let E be an EmtbRE. The expression E is Ø-linear if it is linear and it has no 
occurrence of the symbol Ø. 


Definition 9 The -linearized expression E” ofan EmtbRE E is computed by independently 
indexing by two distinct alphabets the occurrences of symbols and the occurrences of Ø, then 
by replacing them in E by their indices. 


Example 1 The %-linearized expression of the EmtbRE E = (a -b -a +a- Ø) -b -b -Ø is the 
expression E? = (1-2-3+4-I)-5-6- II, with the position sets Pos(E) = {1, 2, 3, 4, 5, 6} 
and Posg(E) = {1, IT}. 


The associated morphism is hg + (Pos(E)UPosg(E))* — (Xg U{Ø}* where Ø represents 
the character associated to the empty set. Notice that L(E”) 4 Ø and that every Ø-linearized 
expression is Ø-linear: 


Definition 10 Let O be a set of operators. The set O fits with the J-linearization if for every 
expression E using these operators, considering F” the Ø-linearized expression of E, it holds: 


L(E) = hg(L(E”)  (Pos(E))*). 


The two following properties are satisfied: (I) the set of operators of an EmtbRE fits with 
the J-linearization; (II) the language of an EmtbRE is a regular one. Property (I) is proved 
by next lemma. Property (II) is proved in Sect. 4 where an EmtbRE is converted into an 
equivalent NFA. 

Simple regular operators fit with the Ø-linearization, as well as multi-bar and multi-tilde 
operators and multi-tilde-bar operators, that are a combination of multi-tilde and multi-bar 
operators. 


Lemma 1 Let O be the set of operators defined by : 


= 7 Aw 
O= {+, *y } U U T;B’ 
T,B|AnéN,(T,BES,)A(TNB=9) 


The set O fits with the -linearization. 
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Proof Let E be an expression over the operators in O, and let E’ be its J-linearized expres- 
sion. We extend the induction on the structure of E. Let E’? = FP rpl Ei n)- 


Let w be a word in L(E). According to Definition 6, there exists a free sublist T” of T 
and s = (w1, ..., Wn) a split up of w over (L(E1) U {e}, ..., L(En) U {£}) such that T’ 
generates s and such that s satisfies every bar in B w.r.t. T”. Hence, according to Definition 2, 
for all integer k in [1, n], wg = e€ if there exists (i, f) in T’ such that k € fi, f], we € L(Ex) 
otherwise. According to the induction hypothesis, for every integer k in [1, n], there exists 
a word w, satisfying hg (w,) = wg, and w, = e if there exists (i, f) in T’ such that k € 
li, fl, w, € L(E;) otherwise. Consequently, T’ generates the sequence s’ = (w},..., wy, 
and since s satisfies every bar in B w.rt. T’, so does s’. According to Definition 6, the 
word w) --- wi, is in L(E°) and since by construction hg(w’) = w, w = wy, -:: Wp is in 
he (L(E”) N Pos(E)*). 

Let w be a word in hg(L(E”) N Pos(E)*). Then there exists a word w’ in L(E) such 
that hg(w’) = w. According to Definition 6, there exists a free sublist T’ of T and s’ = 
(wi, -.-, wp) a split up of w’ over the languages (L(E}) U {e},..., L(E}) U {e}) such 
that T” generates s’ and such that s’ satisfies every bar in B w.r.t. T”. Hence, according to 
Definition 2, for all integer k in [1, n], w, = £ if there exists a couple (i, f) in T’ such that 
kefi, f], w, E€ L(E;) otherwise. Let us consider the sequence s = (w1, ..., Wn) such that 
wk = h(w;,) for all integer k in [1, n]. By construction, w = w1 - - - wn. According to the 
induction hypothesis, the languages L (Exp) and he(L(£,) N Pos(E)*) are equal. Therefore, 
for every integer k in [1, n], wg = € if there exists a couple (i, f) in T’ such that k € [i, f], 
wx E€ L(E,x) otherwise. Consequently, T’ generates s and since s’ satisfies every bar in B 
w.r.t. T’, so does s. Finally, according to Definition 6, w € L(E). 


3.2 The saturated form 


In this subsection, we define a particular syntactical form for EmtbRE, the saturated form, 
that will be useful to shorten the computation of Glushkov functions of the following section. 
We show that any EmtbRE E admits an equivalent EmtbRE ZF’ in saturated form. 


Definition 11 A multi-tilde-bar operator ~~ T;B is said to be saturated if: 
TUB = [l, n]z. 


Definition 12 An EmtbRE E is said to be in saturated form if every multi-tilde-bar operator 
appearing in E is saturated. 


Two distinct J-linear multi-tilde-bars in saturated form applied to a same list of expressions 
denote two different languages. 


Lemma 2 LetE = ~~ T:B (£1 n) be an§-linear and saturated EmtbRE. Let (w1, ..., Wn) 
be the unique split up of a word w over (L(E1)U{é}, ..., LCEn)U{e}). Then the two following 
conditions are equivalent: 


(1) wisin L(E), 
(2) for every €-maximal factor wi - - -wp of w, the couple (i, f) is in T. 


Proof (1 = 2) Suppose that the word w; --- wy is an ¢-maximal factor of w such that the 
couple (i, f) is not in T. Since E is in saturated form, the couple (i, f) belongs to B. Let us 
show that for every free sublist T’ of T generating s, (i, f) is not satisfied by s. Let T’ be a 
free sublist of T that generates s. Since w; --- wp is an e-maximal factor of w, T’ contains 
no overhanging tilde nor tilde including (i, f). Moreover, since w; --- wf = £, according to 
Definition 5, (i, f) is not satisfied by s w.r.t. T’. According to Definition 6, w ¢ L(E). 
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(1 < 2) Suppose that for every ¢-maximal factor w;--- wf of w, the couple (i, f) is in 
T. Let us consider the list T’ defined by 


T’=(, fy €T | wi +++ wf isan € — maximal factor in w). 


By construction, s is generated by T’. Let us show that s satisfies every bar in B w.r.t. T’. Let 
b = (ip, fy) be a bar in B. If for every tilde t in T’, Range(t) N Range(b) = Ø, by definition 
of s, wi, >° wy, Æ € and b is satisfied by s w.r.t. T’. Otherwise, let t be a tilde in T’ such 
that Range(t) N Range(b) # Ø. If t overlaps b or if t includes b, b is satisfied by s w.r.t. 
T’. If b includes t = (is, fr), since wi, --- wp, is an €-maximal factor, either w;,_1 A £, or 
w+ Æ £. Hence, wi, +- wy, A £ and consequently, s satisfies b w.r.t. T’. Therefore, T’ is 
a sublist of T that generates s and such that s satisfies every bar in B w.r.t. T^. According to 
Definition 6, w € L(E). o 


Proposition 1 Let ~~z .p, and Y~ r.p, be two saturated multi-tilde-bar operators and 
1;21 2;P2 

letEy = “cp, (Fin) and E2 = ty. Boy (Fin) be two Ø-linear EmtbREs. The two 

following conditions are equivalent: 


(1) By = B2 and Tı = To, 
(2) L(E\) = L(E2). 


Proof (1 = 2) Trivial. 

(1 < 2) Since both of the operators are saturated, Bı # Bo if and only if T) Æ T. Let 
t = (i, f) be a tilde in the symmetric difference of T; and T2. Without loss of generality, 
let us suppose that ¢ is in Tı. Let us consider the split up s = (w1, ..., Wn) of a word w 
such that w; --- wp = € and such that for all integer k not in [1, n], wx isin L( F, ) (notice 
that L( Fy ) #0 since E; and E> are both #-linear). According to Lemma 2, w is in L(E1) 
whereas w is not in L(E2). As a consequence, L(E1) 4 L(E2). 


Let us show now that every EmtbRE admits an equivalent EmtbRE in saturated form. In 
this purpose, we define the notion of (i, f)-nullability. 


Definition 13 Let E = ~~z. p(E1,n) be an -linear EmtbRE, Ej n be a list of Ø-linear 
EmtbREs such that for all integer k in |1, n], Zy = Ex and (i, f) be a couple in [1, nz. 
Let us consider the language L’ = L(E}..;_, - Eiin) 

The EmtbRE E is said to be (i, f)-nullable if L' C L(E). 


Lemma 3 Let E = ~v7.p (Ein) be an Ø-linear EmtbRE, Ei „ a list of Y-linear 
EmtbREs such that for all integer k in [1,n], Ey = Ex and (i, f) be a couple in |1, nj. 
Let us consider the language L' = L(E\..,_, Betton): Let w be a word in L’. Then the 
two following conditions are equivalent: 


(1) L' C L(E), 
(2) w € L(E). 


Proof (1 => 2) Trivial. 

(1 < 2) Suppose that w is in L(E). Let us show that every word in L’ is in L(E). Let 
w’ be a word in L’. Since w € L(E), according to Definition 6, there exists a free sublist 
T’ of T and a split up s = (wj,..., Wn) of w over (L(E1) U {e}, ..., L(En) U {e}) such 
that T” generates s and s satisfies every bar in B w.r.t. T’. Hence, for all integer k in [1, n], 
wk = € if there exists a tilde (i, f) in T’ such that k € [i, f], wk € L(E;) otherwise. Since 
wi +++ wp is the only e-maximal factor in w, and since, E being #-linear, w and w’ admit a 
unique split up w.r.t. (L(E)),..., L(E/_,), L(E4 44); ..., L(E’)), w’ can be split up into 
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s’ = (w},..., wh) in such a way as for every integer k in [1, n], w, = e if there exists a 
tilde (i, f) in T’ such that k € [i, f], w, € L(Ex) otherwise. Hence, s’ is generated by T”. 
Let b be a bar in B. Since s satisfies b, either T’ contains a tilde ¢ that overlaps or includes 
b, or wi, ++- wy, # £. By construction of s’, the only e-maximal factor in w’ is w}--- w'y 
Moreover, by definition of s, the only e-maximal factor in w is w;--- wy. Therefore, if 
Wi, Wh, A £ then w; -w # £. Consequently, s’ satisfies b w.r.t. T”. Finally, T’ 
generates s’ and s” satisfies b w.r.t. T’ for every bar b in B, and according to Definition 6, 


w' € L(E). o 


Notice that for an EmtbRE E = ~~z. p (£1), ifacouple (i, f) isin T, then the EmtbRE 
E is (i, f)-nullable. Conversely, if a couple (i, f) isin B, then E is not (i, f)-nullable. The 
saturated form allows us to set the reciprocal part of this proposition: if the EmtbRE E is 
(i, f)-nullable, then (i, f) € T and if E is not (i, f)-nullable, then (i, f) € B. We now 
show that any EmtbRE £ admits an equivalent EmtbRE £’ in saturated form. 


Proposition 2 Let E be an EmtbRE. Then there exists an EmtbRE E' in saturated form such 
that L(E') = L(E). 


We first show that any multi-tilde-bar operator can be replaced by a saturated one, its 
completed, that can be computed w.r.t. the list of expressions on which it applies. 


Definition 14 Let E = ~~r. pg (E1 n) be an Ø-linear EmtbRE, Consider the lists T’ and B’ 
defined by: 


T' = {(i, f) € [L, n] | E is (i, f)—nullable}, 
B' = {(i, f) € [1, n] | Eisnot(i, f)—nullable}. 


The operator ~~ r,;p is said to be the saturated operator associated to the expression 
~r, B (E Ln): 

Obviously, the saturated operator associated to an operator is saturated. Consider two 

os le, RS es p 

Ø-linear EmtbREs E’ = “~y. p, (E1,n) and E = ~q,p) (E1,n) where ~~ Tp is the 
saturated operator associated to ~~ T:B (E1.n)- We now show that the denoted language is 
preserved during saturation. 
Lemma 4 Let E = “yp (E1,n) and E' = ~ pr pg (Ein) be two Ø-linear EmtbREs 


where ~~ pr, pris the saturated operator associated to “TB (E1,n). Then the two following 
conditions are satisfied: 


(1) TCT’, 
(2) BC B'. 
Proof If (i, f) is a couple in T, then E is (i, f)-nullable; Hence, according to Definition 14, 


(i, f) € T’. If (i, f) is a couple in B, then E is not (i, f)-nullable and according to Defini- 
tion 14, (i, f) € B’. o 


Lemma 5 Let E = ~~r.p (E1,n) be an Ø-linear EmtbRE and s = (w1 --- wn) be the 
unique split up of a word w over (L(E1) U {é€}),..., (L(En) U {e}). Then the two following 
conditions are equivalent: 


(1) w € L(E), 
(2) for all e-maximal factor wi +++ wp in w, E is (i, f)-nullable. 
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Proof (1 = 2) Suppose that w € L(E). Then there exists a sublist T’ in T such that 7’ 
generates s and s satisfies every bar in B w.r.t. T’. The sequence s is such that for every integer 
k in [1, n], wz = € if there exists a couple (i, f) in T’ such that k € fi, f], we € L(Ex) 
otherwise. Let (i, f) be a couple in [1, n] such that w; - -- w f is an e-maximal factor in w. Let 
s’ = (w},..., Wh) be the unique split up of a word w’ such that for every integer kin [1, n], 
w, =eifk € fi, f], wi, € L( Ep ) otherwise. Let us consider the list U = T’ Oli; fiz. 
Hence, the sequence s’ is such that for every integer k in [1, n], w, = € if there exists a couple 
(i’, f’) in U such that k € [i’, f’], wx € L(Ep) otherwise. Consequently, U generates s’. 
Let b be a bar in B. If b overlaps or includes (i, f), or if Range(b) A fi, f] = Ø, by definition 
of s’, wi, see w, Æ £ and s’ satisfies b w.r.t. U. Let us suppose that b is included in (i, f). 
Since s satisfies every bar in B w.r.t. T’, s satisfies b w.r.t. T”. By definition, it holds: 


/ / if / 
Wir Wi Wy Wp = Wit Wip Wh WP = E. 


Therefore, there exists a tilde ¢ in T’ such that t overlaps or includes b. By construction of 
U,t € U. Consequently, s’ satisfies b w.r.t. U. According to Definition 6, w’ € L(£) and 
finally, according to Definition 13, E is (i, f)-nullable. 

(1 < 2) Let us suppose that for every e-maximal factor w;--- wy of w, E is (i, f)- 
nullable. Let (i, f) be a couple such that E is (i, f)-nullable. For every integer k in [1,7], 
we set W! = Ep and LE! = LE). 41° Egien) According to Definition 13 and Lemma 3, 
there exists a word w’ in L’ such that w’ € L(E). According to Definition 6, there exists a 
free sublist T; ş of T and a split up s’ = (w1, -+ , w) of w’ such that T; generates s’ and 
such that s’ satisfies every bar in B w.r.t. T; p- Letus consider the list T’ defined as the union 
of these lists: 


T=t@, foe TP |i, f) € [l, n]? ^ wi +++ wy isane — maximal factor in w}. 


Let us show that T” generates s and that s-satisfies every bar in B w.r.t. T’. By construction 


of T', T’ generates s, since each of its ¢-maximal factors w; --- wy is generated by T/, pf 


Let b = (ip, fp) be a bar in B. Let us suppose that s does not satisfy b w.r.t. T”. Hence, 
T’ contains no overlapping tildes nor tildes including b, and w;,--- wy, = £. Either the 
factor wi, --- wf, is €-maximal and we set (ip, fy) = (i, f), or it is included into another 
é-maximal factor w;--- Wip +- w fp wy. Since wi +- wy is €-maximal, then E is (i, f)- 
nullable. Consequently, since there exists a word w’ in L’ such that w’ € L(E), such that Th, f) 
generates w’ and such that w’ satisfies every bar in B w.r.t. TG, py including the bar b, either 
there exists a tilde in Tj. f overlapping or including b (contradiction since t € TG, nE T’), 
or wi ++- wr # € (contradiction with w’ € L’). Hence, s satisfies every bar in B w.r.t. T’. 


Finally, according to Definition 6, w € L(E). 


Proposition 3 Let E = ~p p(Ein) and E' = Sr, p (Ern) be two Q-linear 


EmtbREs where ~~ pr pr is the saturated operator associated to ~~ rp. p (E1,n). Then it 
holds: 


L(E) = L(E’). 


Proof Let w be a word in L(E). According to Definition 6, there exists a free sublist T” of 
T and a split up s = (w1, ..., Wn) of w over (L(E1) U {£}, ..., L(En) U {e}) such that T” 
generates s and s satisfies every bar in B w.r.t. T”. For every integer k in [1, n], wg = € if 
there exists a couple (i, f) in T” such that k € [i, f], we € L(E,) otherwise. According to 
Lemma 5, for every -maximal factor w; --- wp in w, E is (i, f)-nullable. By construction 
of T’, (i, f) € T’. According to Lemma 2, w is in L(E’). 
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Let w be a word in L(E’). According to Definition 6, there exists a free sublist T” of T’ 
and a split up s = (w1, ..., Wn) of w over (L(E1) U {e}, ..., L(En) U {£}) such that T” 
generates s and s satisfies every bar in B’ w.r.t. T”. For every integer k in [1, n], wk = € if 
there exists a couple (i, f) in T” such that k € |i, f], wx € L(Ex) otherwise. According to 
Lemma 2, for every ¢-maximal factor w; --- wp in w, (i, f) € T’. By construction of T’, E 
is (i, f)-nullable. Finally, according to Lemma 5, w € L(E). 


Proof (Proposition 2) 
Let E’ be the Y-linearized EmtbRE of E. According to Lemma 1, we have hg (L (E A 
Pos(E”)*) = L(E). For every multi-tilde-bar subexpression “~ z. p (Fy in) of E’, according 


to Proposition 3, the expression F” = ~~ prep! (Fin) with ~~ rv. p the saturated operator 
associated to “~r. g (Fn) denotes the language L(F). Let G be the expression obtained 
by substituting every multi-tilde-bar subexpression F in E” by the corresponding expression 
F’ computed before. Therefore, the EmtbRE G is in saturated form and denotes L(E >). Let 
E’ = hg(G). According to Lemma 1, E’ is equivalent to E and by construction, E’ is in 
saturated form. 


4 Glushkov functions for EmtbREs 


In this section, we extend the computation of Glushkov functions to EmtbREs in saturated 
form. 


Definition 15 Let E = ~~z. g (Ein) be an J-linear EmtbRE in saturated form, / be an 


integer in [1, n] and x be an element in Pos(£/). The Glushkov functions of the EmtbRE E 
are: 


n 
(15.1) Pos(E) = |_) Pos(Ex). 
k=l 
_je if(d,n)ET, 
(15.2) ule Ẹ otherwise, 
(15.3) First(E) = First(E1)U {x € First(E,) | k € [2,n] A G, k — 1) € T}, 
(15.4) Last(E) = Last(E„) U {x € Last(Ex) | k € [l,n— 1] A (k+1,n) €T}, 
Follow(x, E1) ifl =n v x ¢ Last(£)), 
Follow(x, E1) U First(x, E1+1)U 
{x € First(Ey) | l’ € [l +2,n] otherwise. 
Ad +1,l'’-1) eT} 


(15.5) Follow(x, E) = 


Proposition 4 Let E = YTB (E1 ,n) be an Ø-linear EmtbRE in saturated form deduced 
from an EmtbRE E' by Ø-linearization. Then it holds: 


L(E) = L(G(E)). 
Proof The definition of Glushkov functions are: 


(PI xe Xg & x € Pos(£), 
_ | fe} ife € L(E), 
(P2) Null(E) = | Ø otherwise, 
(P3) Iw = x - w € L(E) & x € First(E), 
(P4) Sw = w' -x € L(E) & x € Last(E), 
(P5) dw = w- x.x  w” € L(E) & x’ € Follow(x, E). 
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Let us show that Propositions (P1)-(P5) are still satisfied by the Glushkov functions of a 
multi-tilde-bar. 

(P1) Pos: x € Ye Sake [ln] |x € Ve, |x € Pos(£). 

(P2) Null: € € L(E) & E is (1, n)-nullable © (i, f) € T. 

(P3) First: (>) Let w = x - w’ be a word in L(E). According to Definition 6, w can be 
split up into s = (w1, ..., Wn) such that for every integer k in [1,7], wg is in L(Ex) U {e}. 
Since w is §-linear, there exists a unique integer / in [1, n] such that w, = x - w, € L(Ep. 
By definition, x € First(£7). If l = 1, according to Definition 15.3, First(E;) C First(£). 
If? A 1, wı-- -w1 is an -maximal factor. According to Lemma 2, (1,/ — 1) is in T. 
According to Definition 15.3, First(£7) C First(£). Consequently, x is in First(£). 

(<=) Let x be an element in First(£). Then there exists an integer / in [1, n] such that 
x is in First(£7) and either / = 1, or l Æ 1 and (1,/ — 1) is in T. By definition, there 
exists a word w; = x - w, in L(E;). Let us consider the split up s = (w1 = €,..., W} —1 = 
E, WI, ..., WI41, - --, Wn) such that for all integer k in Ln], wg € L( Ey ). Ifl=l,w= 
wW] + Wi41++* Wp has no ¢-maximal factor by construction, and according to Lemma 2, w is 
in L(E). If! 4 1, the word w1 --- w)_; is the only -maximal factor in w. Since (1, / — 1) 
is in T, according to Lemma 2, w is in L(E). 

(P4) Last: Similar as (P3) reasonning on last symbols of the words. 
(P5) Follow: Let / be the integer in [1, n] such that x € Pos(£7). 

(=) Let w = w’-x-x'- w” be a word in L(E). According to Definition 6, w can be 
split up into s = (w1, .. . , Wn) such that for all integer k in [1,7], wz is in L(E;) U {e}. Let 
l’ be the integer in [1, n] such that x’ € Pos(E;/). Three cases can occur. (a) If l = l’, then 
w = wj: x: x’ - wi’, Since wy is in L(E£)), by definition, x’ is in Follow(x, E). According 
to Definition 15.5, x’ is in Follow(x, E). (b) If l = L + 1, w, = w)- x and wy = x' - wp. 
By definition, x is in Last(£;) and x’ is in First(Ey). According to Definition 15.5, x’ is in 
Follow(x, E). (c) Let us suppose that / 4 l’ and l’ > 1+ 2. Then w; = w)- x, wy = x'- wy 
and w)+41 +++ wy, is an €-maximal factor in w. According to Lemma 2, (l + 1, l’ — 1) is in 
T. According to Definition 15.5, x’ is in Follow(x, E). 

(<) Let x’ be an element in Follow(x, E). According to Definition 15.5, three cases can 
occur. (a) Suppose that x’ is in Follow(x, E)). By hypothesis, there exists a word w; = w; -x- 
x’- w in L(Er). Lets = (w1,..., wy,..-, Wn) be a split up of a word w = w1 ++- W +++ Wn 
such that for every integer k in[1, n], wz € L( Ex ). The word w has no ¢-maximal factor 
and according to Lemma 2, w is in L(E). (b) Let us suppose that x is in Last(E;) and x’ is 
in Follow(x, E)41). By hypothesis, there exist w, = w,-x in L(Er) and wj41 = x" - Wry in 
L(£/+1). Let us consider the word w defined by the split up s = (w1,..., Wi, Wi4+1,.--, Wn) 
such that for every integer k in [ln], wg € L( Ey, ). The word w has no -maximal factor 
and according to Lemma 2, w is in L(E). (c) Suppose that x is in Last(£7) and that there 
exists an integer l’ in [1,] such that x’ is in First(Ey) and (J + 1,/' — 1) is in T. By 
hypothesis, there exist w; = w; - x in L(E;) and wy = x'- w, in L(Ey). Consider the word 
w defined by the sequence s = (w1, ..., W/, Wi41 = &,.--, Wy_-1 = E, Wy',.--, Wn) such 
that for all integer k in [1,1 — 1] U I +1, n], wk € L( Ex ). The factor w)41 «++ wy, is the 
only e-maximal factor in w, and by hypothesis, (/ + 1, l’ — 1) isin T. According to Lemma 2, 
w isin L(E). 

Oo 


Example 2 Let Ẹ = (a+b) -a* be an EmtbRE in saturated form and 


BE! = (+2) .3* be the @-linearized EmtbRE of E. The Glushkov functions for the 
EmtbRE E are given Table 1. The Glushkov automaton of E is given Fig. 1. 
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Table 1 Glushkov functions for Pos(E) = {1,2,3} 


the EmtbRE E 
Null(E)= ø 
First(E) = {1,2,3} 
Last(E) = {1,2,3} 
Follow(1,E) = {3} 
Follow(2,E) = {3} 
Follow(3,E) = {3} 
Fig. 1 The Glushkov automaton 
of the EmtbRE E © 
a a a 
a 


5 From an extended Glushkov automaton to an EmtbRE 


In this section, we give a characterization of the family of the extended Glushkov automata 
and we show how to convert an extended (n+ 1)-state Glushkov automaton into an equivalent 
EmtbRE with a width equal to n. 


Definition 16 An NFA A is an extended Glushkov automaton if there exists an EmtbRE E 
the Glushkov automaton of which is isomorphic to A. 


An automaton is said to be standard if it has a unique initial state i and if (Q x X x {i})N6 = 
Ø. An automaton is said to be homogeneous if for all transitions (p, a, q), (p’, a’, q’) in 6, 
q=q >a=d’. 

We first show that any standard and homogeneous (n + 1)-state acyclic automaton is 
the Glushkov automaton of an n-width EmtbRE. As a corollary, this result provides an 
alternative proof of the fact that any acyclic n-state automaton admits an equivalent O (n)- 
width EmtbRE [7]. Finally, we extend the characterization of the family of the Glushkov 
automata [9] to the case of non-acyclic automata. 


5.1 Acyclic case 


Proposition 5 Any standard and homogeneous acyclic automaton is an extended Glushkov 
automaton. 


Proof Let A = (X, Q, I, F, 5) be a standard and homogeneous (n+ 1)-state acyclic automa- 
ton. Let t : [0, n] —> Q be a topological sort over the graph of A. The state t (k) will be 
denoted by t. The automaton A being homogeneous, we denote by ag the symbol in X that 
labels the state tg. Let us consider the acyclic automaton A’ = (X U {Ø}, QU Q’, I, F, 8U 8”) 
where Q’ = {Øx | k € 0, n] A Utk} x © x {te41}) N 8 = Ø} and 5 = {(tz, Ø, Øk) | Øk € 
Q'} U {(Øk, ak+1, Tk+1}. By construction, the languages L(A) and L(A’) are equal. Let n’ 
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be the number of states of A’. By construction, the topological sort t’ of the graph of A’ is 
unique. Since A’ is homogeneous, we denote by a; the symbol in X U {Ø} that labels the state 


Ti. Let E be the list of expressions (a), a4,..., Gy). The two lists T’ and B’ are deduced 
from A’ as follows: 
pa [GNela is Aitl 
ACT b Arai Tr Ed USJ)VGAfAf=NAT € F)) 


B' = [Ln] \ T UIG, f) e Fle I f =i +) 


Let us show that A is isomorphic to the Glushkov automaton of the EmtbRE 
F' = rp! (£). Let E” = pip (€’) be the Y-linearized expression of E’ and 5” 
be the alphabet of E”. Since the list of expressions of E” is reduced to a list of symbols, the 
Glushkov functions of E” are easy to compute according to Definition 15: 


(1) By definition, Pos(E”) = X”. 

(3) First(E”) = {k| ,k-D eT’ vk=j 

(2) Null(E”) = {e} if (1, n) € T’, Ø otherwise. 

(4) Last(E”) = {k | (k+ 1,n) € T'Vk =n}. 

(5) Follow(k, E”) ={k |K '=k+1v (k+1,k'— 1) € TY. 


By construction, the Glushkov automaton of E’ is the automaton A’, since (a) any transition 
of A’ either corresponds to a tilde in T’, or is the link between the symbols a, and ap, 
and (b) final states correspond either to a tilde ending on n’ or to the last state t/,. The 
automaton obtained by removing each transition labelled by Ø as well as the state that ends 
such a transition is isomorphic to A. As a consequence, the automaton A is isomorphic to 
the Glushkov automaton of the expression E’, the width of which is equal to n’ — HOla; 
a, =Ø) =n. 


As a corollary, any acyclic (n+ 1)-state automaton (non-necessarily an extended Glushkov 
one) can be transformed into a O (n)-width EmtbRE. 


Lemma 6 ([7]) Let A = (X, Q, I, F, ô) be an acyclic n-state automaton. Then there exists 
an equivalent acyclic automaton A’ = (X, Q', I’, F’, 8") that is standard and homogeneous 
and such that (#Q’) < (4X) x (#Q) + 1. The automaton A’ can be computed in time 
O((#Z) xn’). 


Corollary 1 Let A be an acyclic n-state automaton over an alphabet X. Then there exists 
an equivalent EmtbRE with a width equal to O((#X) x n) that can be computed with an 
O((#X) x n*) worst time complexity. 


5.2 Non-acyclic case 


Let A = (X, Q, I, F, 5) be an automaton. The subset O of Q is said to be a suborbit if for every 
couple (q, q’) in O7, there exists a word in X such that q’ € 5(q, w). The suborbit O is said 
to be an orbit if it is not included into another suborbit. The set {q € O | dg’ € Q\O, Ja € 
x ,q € 5(q’, a)} is denoted by In(O). The set {q € O | dq’ € O\O, Ja € X,q' € (q, a)} 
is denoted by Out(O). Let o € O. The set {q € Q\O | da € X,q € 5(0,a)} is denoted 
by O* (o). The set {q € Q\O | da € Y,0 € (q, a)} is denoted by O~ (o). An orbit is 
said to be transverse if for all couple (0, 0’) in Out(O) x Out(O), OT (0) = OF (0’), and if 
for all couple (0, o’) in In(O) x In(O), O7 (0) = O7 (o’). An orbit is said to be stable if 
for all couple (i,o) in In(O) x Out(O), there exists a in X such that i € 6(0, a). An orbit 
is said to be strongly transverse if it is transverse and if after elimination of the transitions 
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from Out(O) to In(O), all the orbits remaining are strongly transverse. An orbit is said to be 
strongly stable if it is stable and if after elimination of the transitions from Out(Q) to In(O), 
all the orbits remaining are strongly stable. Let A be an automaton the orbits of which are 
strongly transverse and strongly stable. 

We now present the characterization of the family of the extended Glushkov automata. 


Theorem 1 Let A be a standard and homogeneous NFA. The two following conditions are 
equivalent: 


(1) The NFA A is an extended Glushkov automaton. 
(2) Every orbit of A is strongly stable and strongly transverse. 


In order to prove Theorem 1, we first extend the inductive construction of the Glushkov 
automaton of an expression [11] by showing how to inductively compute the Glushkov 
automaton G(E) of an EmtbRE E = “yp (E1,.n) from the Glushkov automata 
G(E}),..., G(E,). Then, we exhibit a special form for EmtbREs such that the Glushkov 
automaton of any EmtbRE is isomorphic to the one of its special form. We also show that 
every orbit of G(£) is strongly stable and strongly transverse. Finally, we show that an 
automaton the orbits of which are strongly stable and strongly transverse is isomorphic to 
the Glushkov automaton of an EmtbRE in special form. 


Lemma 7 ([11]) Let E be a regular expression and G = (X, Q, {i}, F, 5) be its Glushkov 
automaton. Let G' = (X, Q, {i}, F U {i}, 8 U {(p,a,q) | p€ F ^ (i, a, q) € 5}). Then G' 
is the Glushkov automaton of E*. 


Definition 17 Let A; = (Xi, Qi, Ili, Fi, ôi), 1 < i < n, be a standard and homogeneous 
automaton. Let T and B be two lists in S, such that TW B = [1, nz. We denote by 


~ing (Ai .- -, An) the automaton (X, Q, I, F, 6) defined by: 


-X= Ukepin] Xk: 

-= Q = Ucepiny Qk\ Ure pang fe 

= == Aly 

p= | Fa U Okefin apace met Ek if, n) £T, 
Fn U (Ukeftin-1JA&t+1i njer fx) U IZ otherwise, 

6={q,a,q)eOxXxQ| 

(4, a, 9') € Uxegayny ôK) 

Viq =i A (ik, @;q) € & ^ (l,k—1)eT) 

V(q = fe N((ik+1, 4, q^) € beg V (Cix, a, q’) € ôw A (k+ 1,k' — 1) € T)))}. 
Lemma 8 Let Aj = (X;, Qi, Ili, Fi, ôi), 1 < i < n, be the Glushkov automaton of the 
expression E;. Let T and B be two lists in S, such that T Y B = |1, n]2. Then the automaton 
~ op (A1,...,An) is the Glushkov automaton of  7.3(Ein)- 


Proof LetA = ~~r. p (A1, -, An). From Definition 15 and Definition 17, it can be shown 
that A = (X = Pos(E), Q = Pos( E), I = {i}, F, ô) with: 


_ p_ | Last(E) U Null(E) ife € L(E), 
~ | Last(£) otherwise, 
- 6={(q,a,q)€ Ox X x Q |q' € Follow(E, q) V (q = i ^ q' € First(E))}. 


From now on, we consider saturated EmtbRE in a special form, inductively obtained with 
respect to the following rules: 
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a) ETE > © @9y,(2,3);(4,1),(1,3),(2,2),(3,3) 2 9 E) 


S 6,(1,1),(2,2),(1,2)(2F) ife ¢ L(E) U L(F), 

manana F) ife € L(E) Ac ¢ L(F), 
~~ (2,2);(1,1),(1,2) EF) if e ¢ L(E)Ac € L(F), 
wanena gol F) otherwise, 

(3) (E*)* > E* 

(4) if 1,k — 1) € T and (k + 1,n) € T, then: 


(7, 3(E1,b—1) (Ex)*, Betin))* > (27,3 Œ1,k-1: Ee, Ek+1,n))* 


i 


(2) E-F > 


(5) if (1, n) € B, then: 
(pp Ern)" > (ruam a,n)) Ern)" 


(6) ~ m pi (Eik pr, pu (Elim): Ekti) 


> ~ 7, URUT: B UBaUB; (Fi,m+n'-1) 
with: 
T= (i, f) ET'| f <k), 
m {[(@+k-Lftk-DIGh ga if (k, k) ¢ T'U B', 
acs ((i +k—-1,f+k-D|G, fe T")\(kyk +n! —1) otherwise, 
T3=(G, ftm-l|GfreT’ a tek <f) 
U(+m-1,f +m- 1) (i, fyeT’ a k <i), 
Bi = (G, f) € B’ | f <k), 
ia (i+k-1,f+k—-)|@G fe B") if(k, k) € T’ UB’, 
Ss (i+k-1, f +k-DbG f) € B’)\k,k+n'—1) otherwise, 
B; = (i, f+m= 1) | (0, fe B Ai<k<f) 
UG+tm—-1, f +n 21) |ü, f)EB A k<i). 
E! ifl < k, 
For all integer / in [1,n! +m — 1]: F; = 4 Ey py ifk<l<k+m—1, 
Eini otherwise. 


Lemma 9 Let E be an EmtbRE and E' be its special form. The Glushkov automata of E 
and E' are isomorphic. 


Proof It can be shown that transformations (1)—(6) preserve Glushkov functions. o 


Lemma 10 Let E = F* be an EmtbRE in special form. Let us consider the orbit O such that 
G(E) = (X, {i} U O, {i}, F, 5). The automaton obtained after removing transitions from 
Out(O) to In(O) is G(F). 


Proof Suppose that the automaton obtained after removing the transitions from Out(Q) to 
In(O) in G(E) is different from G (F). It implies that there exists a transition (x, a, y) from 
Out(O) to In(O) in G(E) which is also in G(F), i.e. there exists x € Last(F), y € First(F’) 
such that y € Follow(x, F). Since F cannot be a starred expression (rules (3)), there exists 
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an integer k such that F = ~~. p(Fin), © € Last(Ex), and y € Follow(x, Ex). As a 
consequence, FE; is a starred EmtbRE. Moreover, since x € Last(E), y € First(£) (otherwise 
contradiction with (x, a, y) from Out(O) to In(O)), (1,k — 1) € T and (k+1,n) € T. 
Contradiction with ŒE in special form (rules (4)). 


Lemma 11 Every orbit O of the Glushkov automaton G of an EmtbRE E satisfies the two 
following properties: 


(P1) O is strongly stable, 
(P2) O is strongly transverse. 


Proof According to Lemma 9, the Glushkov automaton of an EmtbRE is isomorphic to the 
one of its special form E’. Let us show that every orbit in G(E’) satisfies (P1) and (P2) by 
induction over the structure of £’. 

If E' =a or E’ = Ø, (P1) and (P2) are satisfied for every orbit of G. 

fe = ~~ p.p(F1,n); according to Definition 17, every orbit of O is an orbit of 
the Glushkov automaton of an expression Eg for k € [1,n]. As a consequence, O is 
strongly stable. Moreover, any transition added during the computation of G(E") from 
G(E£}),..., G(E,) is a transition from a former final state to a successor of an initial state. 
Since this construction is realized from all of the final states to all of the successors of a 
former initial state, O remains strongly transverse during the computation of G. 

If E’ = (F)*, according to Lemma 7, G is composed by an initial state i linked to a unique 
orbit O the final states of which are all linked to all of the successors of i. As a consequence, 
O is stable and transverse. According to Lemma 10, removing transitions from In(O) to 
Out(O) results in the Glushkov automaton of F, the orbits of which satisfy (P1) and (P2) by 
induction. 

In all of these cases, the lemma is checked. 


The converse part of Theorem | is based on the following lemma: 


Lemma 12 Let A = (X, {i}U{O}, {i}, FU{i}, ô) be a standard, homogeneous and accessi- 
ble automaton with an initial state linked to a single orbit O that is strongly stable and stronly 
transverse. Let A’ be the automaton obtained by removing the transitions from Out(O) to 
In(O). Then: 


L(A) = L(A". 


Proof Let A = (X, Q, {i}, F, 6). Since i is final, € € L(A). By definition of A, every 
word w in L(A)\{e} can be split up into sy = (a1, W1, ..., an, Wn) where for all k in 
[1, n], (a) the word aj - w; -- -ap - wg labels a path from i to Out(O) and (b) the word 
a, - wy--- ax labels a path from i to In(O). Since several splits up of w may exist, consider 
the largest split up: if a subword wg of w can be split up into w, - ap wg, then the sequence 
V, = (A, Wy eee Wk- Ak, wi, Ak’, Wk’, Ak+1, +--+, Wn) still satisfies the property (a) and 
(b). Either L(A)\{e} is empty and L(A’) also is, either there exists a final state in O. Conse- 
quently, since O is strongly transverse, F = Out(O) and then the words a; - w1 + +- ak © Wk 
for all k in [1, n] are in L(A). Since A is homogeneous, every state in In(Q) is a successor 
of i. Consequently, every word a; - wx for k in [1, n] is the label of a path p from i to a 
state in F and since n is maximal, there exists a path labelled by a; - wz from i to F not 
using edges from Out(O) to In(Q). This path p still exists in A’ after removing the edges 
from Out(O) to In(O) in A. Finally, since any successful path in A’ is a successful path in 
A, L(A) = L(A’)*. 
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Lemma 13 Let A be a standard and homogeneous automaton the orbit of which are strongly 
stable and strongly transverse. There exists an EmtbRE E such that G(E) and A are isomor- 
phic. 


Proof By induction on the structure of A. 

(1) If A is an acyclic automaton, according to Proposition 5, A is a Glushkov automaton. 

(2) If A is composed by an initial and final state 7 linked to a unique orbit O, then according 
to Lemma 12, the automaton A’ obtained by removing transitions from Out(Q) to In(O) 
recognizes a language L such that L* = L(A). Since by definition of the strong stability 
and the strong transversity any orbit of A’ is strongly stable and strongly transverse, A’ is a 
Glushkov automaton the recognized language of which is denoted by an EmtbRE F. As a 
consequence, the EmtbRE F* denotes the language L(A). 

(3) If A is composed by an initial and non-final state i linked to a unique orbit O, then the 
language L(A) is equal to L(A’) where A’ is the automaton obtained after turning final the 
initial state i. From (2), there exists an EmtbRE F such that L(A) = L((F*)). 

(4) If A = (X, Q, {i}, F, 5) is composed by several strongly stable and strongly transverse 
orbits O1, ..., Ox, let us consider for any integer / in [1, k] the automaton A; = (X, OU 
{iz}, {iz}, Out(O;), ô) with 6) = 6N(O; x X x OU Uccingay (i aq, q) and aq the symbol 
labelling transitions entering in a state q. Each one of the automata A; is by induction 
a Glushkov automaton, since by construction each one of its orbit is strongly stable and 
strongly transverse (otherwise, contradiction with strong stablility and stong transversity of 
every orbit of A) and there exists an expression E; denoting L(A/). Let us consider the sets 
O= Ureqag O, and S = {q € Q|q¢O}U req glu: Any element e in S is a 
strongly connected component of A (not necessarily an orbit). As a consequence, consider 
that the set S = {e1,..., eœs)} is ordered according to a topological order. The set § can be 
extended to S’ by adding the set {e, =Ø | thereisnopathfrome; toex+1}. Consider the order 
ek < e, < egy) and let S’ = {f1,..., fm}. Finally consider the list € = (Fi, ..., Fm) of 
EmtbRE where Fp = Ø if fk = Ø, Fk = E, if fk = e1. Let us show now that there exist two 
lists T and B in Sm—1 such that “~ z. p (G(F1), ---,@(Fm)) is isomorphic to G (E). In fact, 
these two lists are easy to compute: i 


T = {(i, f),€ [l,m — Mz | (ei-1} x © x fef) N ê 4 Ø}, 
B = |1, m- 1ÊAT. 


According to Definition 17, ~ r;g(G(Fi), ...,G(Fm)) = A and finally A is isomorphic 


to the Glushkov automaton of the expression “~ y. g (£). 


Finally, proof of Theorem 1 is obtained from Lemmas 11 and 13. 


Example 3 Let us consider the automaton A of Fig. 2. This automaton is composed by an 
initial and final state 0 and an orbit O1 = {1, 2, 3, 4, 5, 6} that is stable and transverse. The 
language L(A) equals (L(A1))* where A, is the automaton represented in Fig. 3, obtained 
removing the only transition from Out(O,) to In(O;), the transition (6, a, 1). The automaton 
A, is composed by four strongly connected components, {{O}, {1}, O2, {6}} with O2 the 
orbit {2, 3,4, 5}. As a consequence, L(A1) is denoted by * (1,3)s1,312\{(1,3) }(& Ep, f) 
where E> denotes the language recognized by the automaton A2 represented in Fig. 4. The 
automaton A2 is composed by an initial and non-final state i2 linked to a stable and transverse 
orbit O2. As a consequence, L(A2) is denoted by(£3)* with £3 denoting the language of 
A3 represented in | in Fig. 5, which is acyclic. The automaton A3 is the Glushkov automaton of 


the expression ~~ ¢7=(2,2),(3,3)};[1, 4]2 \r c, d,e), Aga consequence, A is the Glushkov 
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Fig. 2. The automaton A 


om 


© 


automaton of the EmtbRE 


E=( isma (2,3) ( cr=,2),8,3) ha \rO Sd e))*, f)“. 


6 The succinctness of extended regular expressions 


In this section, we first investigate the power of factorization of the EmtbREs by considering 
as a witness the family of finite languages introduced by Ehrenfeucht and Zeiger [13]. We 
also provide a concrete example of the succinctness of the EmtbREs by considering a family 
of languages studied in [20]. 


6.1 A superpolynomial factorization 


Let us consider the family of finite languages H = {Hz | k > 1} introduced by Ehrenfeucht 
and Zeiger [13]. Fork > 1, let G = (Ux, Vk) be the directed graph defined by: 


Ur = (1, kJandVy = {(i, f) € [1, KZ}. 
The language Hz is the set of paths of the graph Gg going from | to k. 
Proposition 6 Let k > 1 and Ax = (X, Q, I, F, 5) be the NFA defined by: 


O B= [1, k], 
©) Q = [1, k], 
O ={1} 
(-) F = {k}, 


O) ô ={(p,a,4)E€ Q x X x Q |a= (p,q)}. 
Then L(Ax) = Hg. 
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Fig. 3 The automaton A, 


Fig. 4 The automaton A2 b 
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Fig. 6 The NFA Aq recognizing 
H4 


(1,4) 


Fig. 7 The NFA As recognizing 
As 


(1,5) 


Proof By construction, a word w is in Hy if and only if w is the label of a successful path 
in Ax. O 


Example 4 Let us consider the languages H4 and H5. The language H4 is recognized by the 
4-state NFA A; of Fig. 6. The language Hs is recognized by the 5-state NFA As of Fig. 7. 


Proposition 7 (Ehrenfeucht and Zeiger [13]) Let Hy be a language in H. For any simple 
regular expression E, denoting Hy it holds: |E; | = k2(oglogk) | 


Let us note that the lower bound of Proposition 7 has been recently sharpened by Gruber 
and Johannsen [16] by achieving binary alphabets and a tight bound of k® 008%), 

According to Corollary 1, A, being an acyclic k-state automaton over the alphabet X = 
[1, kj]. there exists an EmtbRE Ex such that L(E;,) = L(Ax) and O(|Ex|) = O((#X) x k). 
Since (#7) = eo , we get O(|Ex|) = O(k?). This is sufficient to conclude that for all 
k > 1, there exists an EmtbRE denoting Hz that is exponentially more succinct than any 
equivalent simple regular expression. Actually, we can show that the complexity of | Ex| is 
O(k7) by constructing the standard and homogeneous automaton associated to Ax. 


Proposition 8 Let k > 1 and A, = (X, Q',I', F', 6’) be the NFA defined by: 


OX =([L Az, 
-) U={U(AD|feQal<i< f}, 
OT ={1}, 


OF ={(f,i) € O'| f =k}, 
O8 = (6 j) i, f) 40) € Ox x Qh. 
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Then the NFA A, is standard, homogeneous and L(Aj,) = Hx. 


Proof By construction, Aj, is a standard and homogeneous automaton. Let us show that 
L(A‘,) = L(Ax), where A, is the automaton computed in Proposition 6. Let us consider a 
state f, f > 2, of the automaton Ag. For all 1 <i < f, there exists a transition labelled by 
(i, f) and entering into f. The state f is duplicated into f — 1 states (f, 1), ..., (f, f=) 
so that transitions that enter (f, i) have the same (i, f) label. Conversely, the state i being 
duplicated into i — 1 states (i, 1),..., (i, i — 1), forall 1 < ¢, a transition ((i, t), (i, f), (f, i) 
is added to 5’. By consequence, we have L(A,) = L(Ax). 


Since (#Q’) = 1+ Me) we can conclude that there exists an EmtbRE Eg denoting Hy 


such that | E| = Kae D, Finally we can state the following theorem: 


Theorem 2 Let Hy be a language in H. The three following conditions are satisfied: 


(1) The language H; is recognized by a k-state NFA. 
(2) For every regular expression E, it holds: L(E,) = Hy => |E;| = k2Coglogk) | 
(3) There exists an EmtbRE Ex denoting Hx the width of which is 


6.2 A concrete example of succinctness 


Following [20], let us consider the family of finite languages (L;)x>3, where for any k > 3, 
Lx is defined over the alphabet Xg = {ay,..., ax} by: 


Le = {wi +++ Wk | Vj € [lsk], wj U) A Yj e[l, k-— 1], w; wj Fe}. 


On the one hand, the language Lz is denoted by the EmtbRE 


Ey, = ats By (a1, Sand , A) 


with Tk = {(i,i) | i€ [1, k]} and Bg = {(i, i + 1) | i € [1, k — 1]}, that can be represented 
by: 


E AR ARR RRO 
k= @1 a2 ” ak 


Notice that the alphabetical width of Ex is exactly k and that (#(7; U By)) = 2k — 1. 

On the other hand, the regular language Lg is denoted by a regular expression E; that can 
be inductively constructed according to an algorithm proposed in [20]. We recall here this 
construction; the case where k = 6 is presented in the Example 5. 


Example 5 The language L¢ is denoted by the EmtbRE 


BR- LLLI I 
6 = @1 42 43 Q4 Q5 Q6 


and by the regular expression 
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E6 = 4143 (a4a6 + (€ + a4)as (€ + d6)) 
+(e + a1 )az(a3a5(€ + a6) + (a3 + €)a4(d6 + a5(€ + a6))). 
We say that a subexpression of Ej is: 


— atype-1 one if it is of the form G - ax, 
— atype-2 one if it is of the form G - ax_1 (ax + £), 
— atype-3 one if it is of the form G - (ap—1 (ak + £) + ax). 


where G is an expression of width 1. 

Starting from EY = ajaz + (€ + aj)az(e + a3), Ex) can be computed from E; 
by performing the following substitutions. Any type-1 subexpression G - ag is replaced 
by G - ax (agi + £) (type-1 substitution). Any type-2 subexpression G - ag—1 (ax + €) is 
replaced by G - ax—1 (ax (ax+1 + £) + ak+1) (type-2 substitution). Any type-3 subexpres- 
sion G - (ag—1 (ax + £) + ax) is substituted by G - (az—1ag41 + (E+ ak-1)akak+ı) (type-3 
substitution). 

For 1 <i < 3, let us define the integer sequence t; by: 


ti (k) = the number of type — i subexpressions of E;. 


We also define the integer sequence t by: t(k) = |E;|. 

Notice that a type-1 substitution generates a type-2 subexpression and adds one symbol, 
a type-2 substitution generates a type-3 subexpression and adds two symbols and finally a 
type-3 substitution generates a type-1 subexpression and a type-2 subexpression and adds 
two symbols. 


Consequently, 
ti(k +1) = | : (k) salt 
n(k+ 1 ja ti(k) + (k) iovis, 
ti(k +1) = ¢ (k) n an 


if k=2, 


tk+1)= l t(k) + ti(k) +2t(k)+ 2t (k) otherwise. 


For any integer k greater than 5, the equalities t3 (k) = t2 (k — 1) and tı (k) = ti(k — 2) 
hold and hence t2 (k) = t2 (k — 2) + t2 (k — 3). This recursive formula is the definition of the 
Padovan sequence P (sequence A000931 of OEIS) with a small index shift for initial values; 
here we will consider that P (0) Pd) P(2) 1. A a consequence, the sequence t 
can be computed as follows: t(5) = 12 and for all k > 5, t(k + 1) = t(k) + P(k — 5) + 
2P(k — 3) + 2P (k — 4). Moreover, the Padovan sequence tends toward zi with p the only 
non-complex solution of the equation x? — x — 1 = 0 (p ~ 1.3247). Hence, the width of the 
expression Æ, that is equal to t(k) grows exponentially with k. 

As a result, the two equivalent expressions Eg and E . are such that for k > 5 the width 
of Eg grows linearly with k while the width of E, grows exponentially with k. Since the 
width of the regular expression E; is not proved to be minimal, no theoretical result can be 
stated from the family (L;)x>3. On the opposite this family provides a practical illustration 
of the succinctness of the EmtbREs, especially since the construction of E; 4, from E ;, aims 
to reduce the width of E} 4. Let us remark that the width of any regular expression denoting 
Lx is greater than the width of E; since any symbol of X% occurs only once in Ex. 
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Table 2 The Padovan sequence 


k 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 


P(k) 1 1 1 2 2 3 4 5 7 9 12 16 21 28 37. 49 65 


Table 3 Comparison between s (Eg) and A 


k 5 6 7 8 9 10 11 12 13 14 15 16 


|E;l 12 19 28 40 56 77 105 142 191 256 342 456 
S(Ex) 23 28 33 38 43 48 53 58 63 68 73 78 


We now show that a similar result is obtained if we take into account the number of bars and 
tildes to measure the size of an extended expression. Since a bar or a tilde can be represented 
by a pair of positions, the size of an EmtbRE E can be defined by s (E) = |E|+2(#(T U B)). 
In the general case we have 2(#(T U B)) < |E|?. For the family of languages (Lx)x>3, it 
holds (#(7; U Bg)) = 2k — 1. Consequently, s(E,) = 5k — 2 and the size of Ex grows 
linearly with k. 

Table 2 gives the values of the Padovan sequence, for 0 < k < 16. Table 3 compares the 
size s(Eķ) and the width |El, for 5 < k < 16. It can be seen that for all k > 8, it holds 
S(Ex) < |E;l- 


7 Conclusion 


A main interest of multi-tilde-bar operators is that they are compatible with the lineariza- 
tion. As a consequence, an associated extended expression with alphabetic width n can be 
converted into an (n + 1)-state automaton. Reciprocally, such an automaton can be trans- 
lated into an extended expression with alphabetic width n. Thus, multi-tilde-bar operators 
provide a way to extend the family of Glushkov automata. Moreover, an extended expression 
is generally more succinct than an equivalent standard one. The power of factorization of 
extended expressions is due to the fact that extended expressions are an intermediary model 
between standard expressions and automata. We next intend to investigate the case where 
the application of each tilde on a list of expressions is controlled by a Boolean variable. Sev- 
eral questions-are raised, among which: what is the succinctness of the associated extended 
expressions? how-to convert such an expression into an automaton? 


Acknowledgments We wish to thank H. Gruber who pointed out the superpolynomial factorization of 
multi-tilde-bar expression. 
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